10. Summary
Summary
)](img/screen-shot-2018-07-17-at-4.44.10-pm.png)
REINFORCE increases the probability of "good" actions and decreases the probability of "bad" actions. (Source)
### What are Policy Gradient Methods?
- Policy-based methods are a class of algorithms that search directly for the optimal policy, without simultaneously maintaining value function estimates.
- Policy gradient methods are a subclass of policy-based methods that estimate the weights of an optimal policy through gradient ascent.
- In this lesson, we represent the policy with a neural network, where our goal is to find the weights \theta of the network that maximize expected return.
### The Big Picture
- The policy gradient method will iteratively amend the policy network weights to:
- make (state, action) pairs that resulted in positive return more likely, and
- make (state, action) pairs that resulted in negative return less likely.
### Problem Setup
- A trajectory \tau is a state-action sequence s_0, a_0, \ldots, s_H, a_H, s_{H+1}.
- In this lesson, we will use the notation R(\tau) to refer to the return corresponding to trajectory \tau.
- Our goal is to find the weights \theta of the policy network to maximize the expected return U(\theta) := \sum_\tau \mathbb{P}(\tau;\theta)R(\tau).
### REINFORCE
- The pseudocode for REINFORCE is as follows:
- Use the policy \pi_\theta to collect m trajectories { \tau^{(1)}, \tau^{(2)}, \ldots, \tau^{(m)}} with horizon H. We refer to the i-th trajectory as \tau^{(i)} = (s_0^{(i)}, a_0^{(i)}, \ldots, s_H^{(i)}, a_H^{(i)}, s_{H+1}^{(i)}).
- Use the trajectories to estimate the gradient \nabla_\theta U(\theta): \nabla_\theta U(\theta) \approx \hat{g} := \frac{1}{m}\sum_{i=1}^m \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) R(\tau^{(i)})
- Update the weights of the policy: \theta \leftarrow \theta + \alpha \hat{g}
- Loop over steps 1-3.
### Derivation
- We derived the likelihood ratio policy gradient: \nabla_\theta U(\theta) = \sum_\tau \mathbb{P}(\tau;\theta)\nabla_\theta \log \mathbb{P}(\tau;\theta)R(\tau) .
- We can approximate the gradient above with a sample-weighted average: \nabla_\theta U(\theta) \approx \frac{1}{m}\sum_{i=1}^m \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta)R(\tau^{(i)}).
- We calculated the following: \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta) = \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta (a_t^{(i)}|s_t^{(i)}).
### What's Next?
- REINFORCE can solve Markov Decision Processes (MDPs) with either discrete or continuous action spaces.